Executive Summary
CytoAtlas is a comprehensive computational resource that maps cytokine and secreted protein signaling activity across 29 million human cells from four independent datasets spanning healthy donors, inflammatory diseases, cancers, and cytokine perturbations. The system uses linear ridge regression against experimentally derived signature matrices to infer activity — producing fully interpretable, conditional z-scores rather than black-box predictions.
Key results:
- 1,213 signatures (43 CytoSig + 1,170 SecAct), plus 178 cell-type-specific LinCytoSig variants, validated across 4 independent atlases
- Spearman correlations reach ρ=0.6–0.9 for well-characterized cytokines (IL1B, TNFA, VEGFA, TGFB family)
- Cross-atlas consistency demonstrates signatures generalize across CIMA, Inflammation Atlas, scAtlas, GTEx, and TCGA
- SecAct achieves the highest correlations in bulk & organ-level analyses (median ρ=0.40 in GTEx/TCGA)
Table of Contents
1. System Architecture and Design Rationale
1.1 Why This Architecture?
CytoAtlas was designed around three principles that distinguish it from typical bioinformatics databases:
Principle 1: Linear interpretability over complex models.
Ridge regression (L2-regularized linear regression) was chosen deliberately over methods like autoencoders, graph neural networks, or foundation models. The resulting activity z-scores are conditional on the specific genes in the signature matrix, meaning every prediction can be traced to a weighted combination of known gene responses.
Principle 2: Multi-level validation at every aggregation.
CytoAtlas validates at five levels: donor-level pseudobulk, donor × cell-type pseudobulk, single-cell, bulk RNA-seq (GTEx/TCGA), and bootstrap resampled with confidence intervals.
Principle 3: Reproducibility through separation of concerns.
| Component | Technology | Purpose |
|---|---|---|
| Pipeline | Python + CuPy (GPU) | Activity inference, 10–34x speedup |
| Storage | DuckDB (3 databases, 68 tables) | Columnar analytics, no server needed |
| API | FastAPI (262 endpoints) | RESTful data access, caching, auth |
| Frontend | React 19 + TypeScript | Interactive exploration (12 pages) |
1.2 Processing Scale
| Dataset | Cells | Samples | Time (wall clock) | GPU |
|---|---|---|---|---|
| CIMA | 6.5M | 421 donors | ~2h | A100 |
| Inflammation Atlas | 6.3M | 1,047 samples | ~2h | A100 |
| scAtlas | 6.4M | 781 donors | ~2h | A100 |
| parse_10M | 9.7M | 1,092 conditions | ~3h | A100 |
Time = wall-clock time for full activity inference (ridge regression across all signatures) on a single NVIDIA A100 GPU (80 GB VRAM).
2. Dataset Catalog
2.1 Datasets and Scale
| # | Dataset | Cells | Donors/Samples | Cell Types | Reference |
|---|---|---|---|---|---|
| 1 | CIMA | 6,484,974 | 421 donors | 27 L2 / 100+ L3 | J. Yin et al., Science, 2026 |
| 2 | Inflammation Atlas | 6,340,934 | 1,047 samples | 66+ | Jimenez-Gracia et al., Nature Medicine, 2026 |
| 3 | scAtlas | 6,440,926 | 781 donors | 100+ | Q. Shi et al., Nature, 2025 |
| 4 | parse_10M | 9,697,974 | 12 donors × 91 cytokines | 18 PBMC types | Oesinghaus et al., bioRxiv, 2026 |
2.2 Disease and Condition Categories
CIMA (421 healthy donors): Healthy population atlas with paired blood biochemistry (19 markers: ALT, AST, glucose, lipid panel, etc.) and plasma metabolomics (1,549 features). Enables age, BMI, sex, and smoking correlations with cytokine activity.
Inflammation Atlas (20 diseases): RA, SLE, Sjogren's, PSA, Crohn's, UC, COVID-19, Sepsis, HIV, HBV, BRCA, CRC, HNSCC, NPC, COPD, Cirrhosis, MS, Asthma, Atopic Dermatitis
scAtlas: Normal (35+ organs) + Cancer (15+ types: LUAD, CRC, BRCA, LIHC, PAAD, KIRC, OV, SKCM, GBM, etc.)
parse_10M: 90 cytokines × 12 donors — independent in vitro perturbation dataset for comparison. A considerable portion of cytokines (~58%) are produced in E. coli, with the remainder from insect (Sf21, 12%) and mammalian (CHO, NS0, HEK293, ~30%) expression systems. Because exogenous perturbagens may induce effects differing from endogenously produced cytokines, parse_10M serves as an independent comparison rather than strict ground truth. CytoSig/SecAct has a potential advantage in this regard, as it infers relationships directly from physiologically relevant samples.
2.3 Signature Matrices
| Matrix | Targets | Construction | Reference |
|---|---|---|---|
| CytoSig | 43 cytokines | Median log2FC across all experimental bulk RNA-seq | Jiang et al., Nature Methods, 2021 |
| LinCytoSig | 178 (45 cell types × 1–13 cytokines) | Cell-type-stratified median from CytoSig database (methodology) | This work |
| SecAct | 1,170 secreted proteins | Median global Moran's I across 1,000 Visium datasets | Ru et al., Nature Methods, 2026 (in press) |
3. Scientific Value Proposition
3.1 What Makes CytoAtlas Different from Deep Learning Approaches?
Most single-cell analysis tools use complex models (VAEs, GNNs, transformers) that produce aggregated, non-linear representations difficult to interpret biologically. CytoAtlas takes the opposite approach:
| Property | CytoAtlas (Ridge Regression) | Typical DL Approach |
|---|---|---|
| Model | Linear (z = Xβ + ε) | Non-linear (multi-layer NN) |
| Interpretability | Every gene's contribution is a coefficient | Feature importance approximated post-hoc |
| Conditionality | Activity conditional on specific gene set | Latent space mixes all features |
| Confidence | Permutation-based z-scores with CI | Often point estimates only |
| Generalization | Tested across 6 independent cohorts | Often held-out splits of same cohort |
| Bias | Transparent — limited by signature matrix genes | Hidden in architecture and training data |
The key insight: CytoAtlas is not trying to replace DL-based tools. It provides an orthogonal, complementary signal that a human scientist can directly inspect. When CytoAtlas says "IFNG activity is elevated in CD8+ T cells from RA patients," you can verify this by checking the IFNG signature genes in those cells.
3.2 What Scientific Questions Does CytoAtlas Answer?
- Which cytokines are active in which cell types across diseases? — IL1B/TNFA in monocytes/macrophages, IFNG in CD8+ T and NK cells, IL17A in Th17, VEGFA in endothelial/tumor cells, TGFB family in stromal cells — quantified across 20 diseases, 35 organs, and 15 cancer types.
- Are cytokine activities consistent across independent cohorts? — Yes. IL1B, TNFA, VEGFA, and TGFB family show consistent positive correlations across all 6 validation atlases (Figure 6).
- Does cell-type-specific biology matter for cytokine inference? — For select immune types, yes: LinCytoSig improves prediction for Basophils (+0.21 Δρ), NK cells (+0.19), and DCs (+0.18), but global CytoSig wins overall (Figures 9–10).
- Which secreted proteins beyond cytokines show validated activity? — SecAct (1,170 targets) achieves the highest correlations across all atlases (median ρ=0.33–0.49), with novel validated targets like Activin A (ρ=0.98), CXCL12 (ρ=0.92), and BMP family (Figure 11).
- Can we predict treatment response from cytokine activity? — We are incorporating cytokine-blocking therapy outcomes from bulk RNA-seq to test whether predicted cytokine activity associates with therapy response. Additionally, Inflammation Atlas responder/non-responder labels enable treatment response prediction using cytokine activity profiles as features.
3.3 Validation Philosophy
CytoAtlas validates against a simple but powerful principle: if CytoSig predicts high IFNG activity for a sample, that sample should have high IFNG gene expression. This expression-activity correlation is computed via Spearman rank correlation across donors/samples.
This is a conservative validation — it only captures signatures where the target gene itself is expressed. Signatures that act through downstream effectors would not be captured, meaning our validation underestimates true accuracy.
4. Validation Results
4.1 Overall Performance Summary
How “N Targets” is determined: A target is included in the validation for a given atlas only if (1) the target’s signature genes overlap sufficiently with the atlas gene expression matrix, and (2) the target gene itself is expressed in enough samples to compute a meaningful Spearman correlation. Targets whose gene is absent or not detected in a dataset are excluded.
Donor-only atlases (CIMA, Inflammation, GTEx, TCGA): N = number of unique targets with valid correlations. CytoSig defines 43 cytokines and SecAct defines 1,170 secreted proteins. The Inflammation Atlas (main/validation cohorts) retains only 33 of 43 CytoSig targets and 805 of 1,170 SecAct targets because 10 cytokine genes (BDNF, BMP4, CXCL12, GCSF, IFN1, IL13, IL17A, IL36, IL4, WNT3A) are not sufficiently expressed in these blood/PBMC samples. CIMA, GTEx, and similar multi-organ datasets retain nearly all targets (≥97%).
Donor-organ atlases (scAtlas Normal, scAtlas Cancer): N = target × organ pairs, because validation is stratified by organ/tissue context. For scAtlas Normal, each target is validated independently across 25 organs (Bladder, Blood, Breast, Colon, Heart, Kidney, Liver, Lung, etc.), yielding up to 43 × 25 = 1,075 CytoSig entries (actual: 1,013 after filtering) and 1,140 × 25 = 28,500 SecAct entries (actual: 27,154). For scAtlas Cancer, validation spans 7 tissue contexts (Tumor, Adjacent, Blood, Metastasis, Pleural Fluids, Pre-Lesion, All), yielding 43 × 7 = 301 CytoSig entries (actual: 295) and 1,140 × 7 = 7,980 SecAct entries (actual: 7,809). Some target-organ pairs are excluded when the target gene lacks sufficient expression in that organ.
Note on scAtlas duplicate entries: At finer aggregation levels (e.g., donor_organ_celltype1 vs donor_organ_celltype2), the same target can appear multiple times with different correlation values. This is expected — finer cell-type annotation changes the composition of each pseudobulk sample, yielding different expression-activity relationships. The summary table above uses the donor_organ level for scAtlas.
4.2 Correlation Distributions [Statistical Methods]
Why does SecAct appear to underperform CytoSig in the Inflammation Atlas?
This is a composition effect, not a genuine performance gap, confirmed by two complementary statistical tests:
Total comparison (Mann–Whitney U test): Compares the full ρ distributions of CytoSig (43 cytokine signatures) vs SecAct (~1,170 secreted protein signatures). The Mann–Whitney U test is appropriate because the two target sets are independent with unequal sample sizes — CytoSig and SecAct measure different proteins, so there is no natural pairing. Sample sizes per atlas vary because each target × cell-type/tissue combination contributes one ρ value (e.g., bulk atlases yield ~43 vs ~1,130; scAtlas Normal yields 1,013 vs 27,154 across 25+ tissues). SecAct achieves a significantly higher median ρ in 5 of 6 atlases (CIMA: p = 0.032; GTEx: p = 2.1 × 10−3; TCGA: p = 3.9 × 10−6; scAtlas Cancer: p = 1.1 × 10−10; scAtlas Normal: p = 7.4 × 10−32). The Inflammation Atlas is the sole exception (U = 151,018, p = 0.657, not significant) and the only atlas where CytoSig’s median ρ (0.215) exceeds SecAct’s (0.148).
Matched comparison (Wilcoxon signed-rank test): Restricts to the 22 targets shared between both methods, with each target’s ρ averaged across cell types to yield one paired value. The Wilcoxon signed-rank test is appropriate because this is a paired design — each target serves as its own control, with one ρ from CytoSig and one from SecAct for the same cytokine. SecAct’s median ρ is consistently higher across all 6 atlases, reaching significance in 4 (GTEx: p = 5.3 × 10−3; TCGA: p = 8.1 × 10−5; scAtlas Normal: p = 1.2 × 10−4; scAtlas Cancer: p = 1.7 × 10−3). CIMA is borderline (W = 67, p = 0.054) and the Inflammation Atlas is not significant (W = 86, p = 0.198).
The Inflammation Atlas is largely blood-derived, so many SecAct targets that perform well in multi-organ contexts contribute near-zero or negative correlations here. In fact, 99 SecAct targets are negative only in inflammation but positive in all other atlases, reflecting tissue-specific expression limitations rather than inference failure. The “Matched” tab above demonstrates the fair comparison on equal footing.
4.3 Best and Worst Correlated Targets
Consistently well-correlated CytoSig targets (mean ρ > 0.3 across 6 atlases):
- IL1B (mean ρ = 0.55) — canonical inflammatory cytokine, positive in all 6 atlases
- TNFA (mean ρ = 0.50), IFNG (mean ρ = 0.45) — master inflammatory regulators
- IL27 (mean ρ = 0.43), IL1A (mean ρ = 0.42), TGFB3 (mean ρ = 0.39)
- OSM (mean ρ = 0.38), VEGFA (mean ρ = 0.38), IL6 (mean ρ = 0.37), LIF (mean ρ = 0.36)
Poorly correlated CytoSig targets (mean ρ near zero across 6 atlases):
- CD40L (mean ρ = +0.01; range −0.55 to +0.57) — highly atlas-dependent
- TRAIL (mean ρ = +0.00; range −0.55 to +0.58) — highly atlas-dependent
- LTA (mean ρ = −0.02), IL2 (mean ρ = −0.11), IL4 (mean ρ = −0.07)
- HGF (mean ρ = +0.06; range −0.33 to +0.40) — atlas-dependent
Gene mapping verified: All four targets are correctly mapped (CD40L→CD40LG, TRAIL→TNFSF10, LTA→LTA, HGF→HGF). No gene ID confusion exists. The poor correlations reflect specific molecular mechanisms:
| Target | Gene | Dominant Mechanism | Contributing Factors |
|---|---|---|---|
| CD40L | CD40LG | Platelet-derived sCD40L invisible to scRNA-seq (~95% of circulating CD40L); ADAM10-mediated membrane shedding | Unstable mRNA (3′-UTR destabilizing element); transient expression kinetics (peak 6–8h post-activation); paracrine disconnect (T cell → B cell/DC) |
| TRAIL | TNFSF10 | Three decoy receptors (DcR1/TNFRSF10C, DcR2/TNFRSF10D, OPG/TNFRSF11B) competitively sequester ligand without signaling | Non-functional splice variants (TRAIL-beta, TRAIL-gamma lack exon 3) inflate mRNA counts; cathepsin E-mediated shedding; apoptosis-induced survival bias in scRNA-seq data |
| LTA | LTA | Obligate heteromeric complex with LTB: the dominant form (LTα1β2) requires LTB co-expression and signals through LTBR, not TNFR1/2 | Mathematical collinearity with TNFA in ridge regression (LTA3 homotrimer binds the same TNFR1/2 receptors as TNF-α); 7 known splice variants; low/transient expression |
| HGF | HGF | Obligate mesenchymal-to-epithelial paracrine topology: HGF produced by fibroblasts/stellate cells, MET receptor on epithelial cells | Secreted as inactive pro-HGF requiring proteolytic cleavage by HGFAC/uPA (post-translational activation is rate-limiting); ECM/heparin sequestration creates stored protein pool invisible to transcriptomics |
Key insight: None of these targets have isoforms or subunits mapping to different gene IDs that would cause gene ID confusion. The poor correlations are supposedly driven by post-translational regulation (membrane shedding, proteolytic activation, decoy receptor sequestration), paracrine signaling topology (producer and responder cells are different cell types), and heteromeric complex dependence (LTA requires LTB). These represent fundamental limitations of correlating ligand mRNA abundance and predicted activity as validation strategy of cytokine activity prediction model.
CytoSig vs SecAct comparison (mean ρ across 6 atlases). Both methods agree on top-performing targets. For some targets where CytoSig shows near-zero correlations, SecAct shows positive correlations:
| Consistently well-correlated | |||
|---|---|---|---|
| Target | Gene | CytoSig Mean ρ | SecAct Mean ρ |
| IL1B | IL1B | +0.552 | +0.573 |
| TNFA | TNF | +0.500 | +0.450 |
| IFNG | IFNG | +0.448 | +0.344 |
| IL27 | IL27 | +0.434 | +0.402 |
| OSM | OSM | +0.384 | +0.567 |
| Poorly correlated in CytoSig — higher in SecAct | |||
| Target | Gene | CytoSig Mean ρ | SecAct Mean ρ |
| CD40L | CD40LG | +0.011 | +0.443 |
| TRAIL | TNFSF10 | +0.001 | +0.446 |
| LTA | LTA | −0.022 | +0.518 |
| HGF | HGF | +0.061 | +0.559 |
The two methods differ in how their signature matrices are constructed. CytoSig derives signatures from log2 fold-change in cytokine stimulation experiments, providing a controlled differential signal that is less susceptible to correlative noise. SecAct derives signatures from spatial co-expression correlations (Moran’s I across 1,000+ Visium spatial transcriptomics datasets), which may capture co-expression patterns that differential signatures miss — but may also reflect indirect or correlative associations. Neither approach is universally superior: CytoSig’s perturbation-based design is cleaner for well-characterized targets, while SecAct’s correlation-based design can recover signal for targets where the mRNA–activity relationship is confounded by post-translational regulation or paracrine topology. Select “SecAct” in the dropdown above to compare interactively.
4.4 Cross-Atlas Consistency
Note on Inflammation Atlas variability: The Inflammation Atlas is a combination of three independent cohorts (main, validation, and external). Because of this, it may exhibit larger variation in correlation values compared to the other atlases, which each represent a single cohort.
4.5 Effect of Aggregation Level [Statistical Methods]
Statistical comparison across aggregation levels:
Total comparison (Mann–Whitney U test): Compares the full ρ distributions of CytoSig (43 targets) vs SecAct (~1,170 targets) at each aggregation level. The Mann–Whitney U test is appropriate because the two target sets are independent with unequal sample sizes. SecAct shows significantly higher median ρ at all levels in CIMA (Donor Only: p = 0.032; Donor × L4: p < 10−96), scAtlas Normal (all p < 10−31), and scAtlas Cancer (all p < 10−10). The Inflammation Atlas shows no significant difference at any level (p = 0.66, 0.36, 0.19), mirroring the Section 4.2 finding.
Matched comparison (Wilcoxon signed-rank test): Restricts to the 22 targets shared between both methods, with each target’s ρ averaged across cell types to yield one paired value per target at each level. The Wilcoxon signed-rank test is appropriate because this is a paired design. SecAct’s median ρ is consistently higher in scAtlas Normal (Donor × Organ: p < 0.001; CT1: p < 0.001; CT2: p = 0.004) and scAtlas Cancer (all p < 0.002). CIMA shows significance only at Donor × L1 (p = 0.036) with the remaining levels borderline or not significant (p = 0.054–0.098). The Inflammation Atlas shows no significant difference at any level (p = 0.198, 0.337, 0.679), consistent with the Section 4.2 matched comparison.
Trend across all atlases: Both methods’ median correlations decline monotonically with finer aggregation — consistent with the statistical noise introduced by splitting fewer cells into more pseudobulk profiles. SecAct’s advantage over CytoSig is maintained or amplified at finer levels in 3 of 4 atlases. The Inflammation Atlas is the exception, likely because it is largely blood-derived, limiting many SecAct targets that depend on tissue-specific expression patterns.
Aggregation levels explained: Pseudobulk profiles are aggregated at increasingly fine cell-type resolution. At coarser levels, each pseudobulk profile averages more cells, yielding smoother expression estimates but masking cell-type-specific signals. At finer levels, each profile is more cell-type-specific but based on fewer cells.
| Atlas | Level | Description | N Cell Types |
|---|---|---|---|
| CIMA | Donor Only | Whole-sample pseudobulk per donor | 1 (all) |
| Donor × L1 | Broad lineages (B, CD4_T, CD8_T, Myeloid, NK, etc.) | 7 | |
| Donor × L2 | Intermediate (CD4_memory, CD8_naive, DC, Mono, etc.) | 28 | |
| Donor × L3 | Fine-grained (CD4_Tcm, cMono, Switched_Bm, etc.) | 39 | |
| Donor × L4 | Finest marker-annotated (CD4_Th17-like_RORC, cMono_IL1B, etc.) | 73 | |
| Inflammation | Donor Only | Whole-sample pseudobulk per donor | 1 (all) |
| Donor × L1 | Broad categories (B, DC, Mono, T_CD4/CD8 subsets, etc.) | 18 | |
| Donor × L2 | Fine-grained (Th1, Th2, Tregs, NK_adaptive, etc.) | 65 | |
| scAtlas Normal | Donor × Organ | Per-organ pseudobulk (Bladder, Blood, Breast, Lung, etc.) | 25 organs |
| Donor × Organ × CT1 | Broad cell types within each organ | 191 | |
| Donor × Organ × CT2 | Fine cell types within each organ | 356 | |
| scAtlas Cancer | Donor × Organ | Per-tissue pseudobulk (Tumor, Adjacent, Blood, Metastasis, etc.) | 7 contexts |
| Donor × Organ × CT1 | Broad cell types within each tissue context | ~120 | |
| Donor × Organ × CT2 | Fine cell types within each tissue context | ~220 |
4.6 Representative Scatter Plots
4.7 Biologically Important Targets Heatmap
How each correlation value is computed: For each (target, atlas) cell, we compute Spearman rank correlation between predicted cytokine activity (ridge regression z-score) and target gene expression across all donor-level pseudobulk samples. Specifically:
- Pseudobulk aggregation: For each atlas, gene expression is aggregated to the donor level (one profile per donor or donor × cell type).
- Activity inference: Ridge regression (
secactpy.ridge, λ=5×105) is applied using the signature matrix (CytoSig: 4,881 genes × 43 cytokines; SecAct: 7,919 genes × 1,170 targets) to predict activity z-scores for each pseudobulk sample. - Correlation: Spearman ρ is computed between the predicted activity z-score and the original expression of the target gene across all donor-level samples within that atlas. A positive ρ means higher predicted activity tracks with higher target gene expression.
GTEx/TCGA use donor-only pseudobulk; CIMA uses donor-only; Inflammation uses donor-only; scAtlas uses donor × organ.
4.8 Comprehensive Validation Across All Datasets
5. CytoSig vs LinCytoSig vs SecAct Comparison
5.1 Method Overview
| Method | Targets | Genes | Specificity | Selection |
|---|---|---|---|---|
| CytoSig | 43 cytokines | 4,881 curated | Global (all cell types) | — |
| LinCytoSig (orig) | 178 (45 CT × cytokines) | All ~20K | Cell-type specific | Matched cell type |
| LinCytoSig (gene-filtered) | 178 | 4,881 (CytoSig overlap) | Cell-type specific | Matched cell type |
| LinCytoSig Best (combined) | 43 (1 per cytokine) | All ~20K | Best CT per cytokine | Max combined GTEx+TCGA ρ |
| LinCytoSig Best (comb+filt) | 43 (1 per cytokine) | 4,881 (CytoSig overlap) | Best CT per cytokine | Max combined ρ (filtered) |
| LinCytoSig Best (GTEx) | 43 (1 per cytokine) | All ~20K | Best CT per cytokine | Max GTEx ρ |
| LinCytoSig Best (TCGA) | 43 (1 per cytokine) | All ~20K | Best CT per cytokine | Max TCGA ρ |
| LinCytoSig Best (GTEx+filt) | 43 (1 per cytokine) | 4,881 (CytoSig overlap) | Best CT per cytokine | Max GTEx ρ (filtered) |
| LinCytoSig Best (TCGA+filt) | 43 (1 per cytokine) | 4,881 (CytoSig overlap) | Best CT per cytokine | Max TCGA ρ (filtered) |
| SecAct | 1,170 secreted proteins | Spatial Moran’s I | Global (all cell types) | — |
Gene filter: LinCytoSig signatures restricted from ~20K to CytoSig’s 4,881 curated genes. Best selection: For each cytokine, test all cell-type-specific LinCytoSig signatures and select the one with the highest bulk RNA-seq correlation. “Combined” uses pooled GTEx+TCGA; “GTEx” and “TCGA” select independently per bulk dataset. “+filt” variants apply the same cell-type selection but restrict to CytoSig gene space. See LinCytoSig Methodology for details.
Ten methods compared on identical matched pairs across 4 combined atlases:
- CytoSig — 43 cytokines, 4,881 curated genes, global (all cell types)
- LinCytoSig (orig) — cell-type-matched signatures, all ~20K genes
- LinCytoSig (gene-filtered) — cell-type-matched signatures, restricted to CytoSig’s 4,881 genes
- LinCytoSig Best (combined) — best cell-type signature per cytokine (selected by combined GTEx+TCGA bulk ρ), all ~20K genes
- LinCytoSig Best (comb+filt) — best combined bulk signature, restricted to 4,881 genes
- LinCytoSig Best (GTEx) — best per cytokine selected by GTEx-only bulk ρ, all ~20K genes
- LinCytoSig Best (TCGA) — best per cytokine selected by TCGA-only bulk ρ, all ~20K genes
- LinCytoSig Best (GTEx+filt) — GTEx-selected best, restricted to 4,881 genes
- LinCytoSig Best (TCGA+filt) — TCGA-selected best, restricted to 4,881 genes
- SecAct — 1,170 secreted proteins (Moran’s I), subset matching CytoSig targets
Key findings:
- SecAct achieves the highest median ρ across all 4 combined atlases, benefiting from spatial-transcriptomics-derived signatures.
- CytoSig outperforms most LinCytoSig variants at donor level, with one notable exception: scAtlas Normal Best-orig (0.298) exceeds CytoSig (0.216).
- Gene filtering improves LinCytoSig in most atlases (CIMA +102%, Inflammation Atlas), confirming noise reduction from restricting the gene space.
- GTEx-selected best performs comparably to combined-selected in most atlases but slightly better in scAtlas Cancer (0.300 vs 0.275). TCGA-selected best generally underperforms other selection strategies, suggesting GTEx’s broader tissue coverage provides more generalizable selections.
- Gene filtering of GTEx/TCGA-selected: GTEx+filt and TCGA+filt show mixed results — filtering sometimes improves (e.g., TCGA+filt in Inflammation Atlas: 0.260 vs TCGA-orig 0.168) but can also reduce performance, indicating the optimal gene space depends on both the selection dataset and atlas context.
- General ranking: SecAct > CytoSig > LinCytoSig Best variants > LinCytoSig (filt) > LinCytoSig (orig), though atlas-specific exceptions exist.
5.2 Effect of Aggregation Level
Methodology: At each cell-type aggregation level (CIMA: L1–L4 = 7–73 cell types; Inflammation: L1–L2; scAtlas: CT1–CT2 = coarse/fine), we match CytoSig, LinCytoSig, and SecAct on identical (cytokine, cell type) pairs — using the exact same pseudobulk samples and identical n for all three methods. For each pair, Spearman ρ measures agreement between predicted activity and target gene expression. If lineage-specific aggregation helps, LinCytoSig should increasingly outperform CytoSig as cell-type resolution increases (L1 → L4).
5.2.1 Distribution at Each Level
5.2.2 Summary
n = number of three-way matched pairs. Δρ = LinCytoSig − competitor (negative = LinCytoSig underperforms).
5.2.3 Which Cell Types Benefit?
Aggregated across all atlases at finest celltype level. Green = LinCytoSig wins more; red = LinCytoSig loses more.
5.2.4 Which Cytokines Benefit?
Sorted by mean Δρ vs CytoSig (best to worst).
Key finding: Lineage-specific aggregation provides no systematic advantage at any level.
- At every level, LinCytoSig underperforms CytoSig (mean Δρ ranges from −0.08 at coarse L1 to −0.02 at fine L4 in CIMA). Finer cell types reduce the gap slightly but never close it.
- SecAct wins at every level in CIMA and scAtlas. In Inflammation Atlas L2, LinCytoSig is nearly tied with SecAct (Δρ = +0.01) but still loses to CytoSig.
- Per cell type: Only 5 of 43 cell types show consistent LinCytoSig advantage vs CytoSig (NK Cell, Basophil, DC, Trophoblast, Arterial Endothelial). No cell type beats SecAct.
- Interpretation: CytoSig’s global signature, derived from median log2FC across all cell types, already captures the dominant transcriptional response. Restricting to a single cell type’s response introduces noise from small sample sizes without gaining meaningful lineage specificity. The hypothesis that finer resolution should favor LinCytoSig is not supported by the data.
5.3 SecAct: Breadth Over Depth
- Highest median ρ in organ-level analyses (scAtlas normal: 0.307, cancer: 0.363)
- Highest median ρ in bulk RNA-seq (GTEx: 0.395, TCGA: 0.415)
- 97.1% positive correlation in TCGA
- Wins decisively at celltype level against both CytoSig and LinCytoSig in scAtlas (19/3 wins vs CytoSig in scAtlas Normal, 20/2 in Cancer)
6. Key Takeaways for Scientific Discovery
6.1 What CytoAtlas Enables
- Quantitative cytokine activity per cell type per disease — 43 CytoSig cytokines + 1,170 SecAct secreted proteins across 29M cells
- Cross-disease comparison — same signatures validated across 20 diseases, 35 organs, 15 cancer types
- Independent perturbation comparison — parse_10M provides 90 cytokine perturbations × 12 donors × 18 cell types for independent comparison with CytoSig predictions
- Multi-level validation — donor, donor × celltype, bulk RNA-seq (GTEx/TCGA), and resampled bootstrap validation across 6 atlases
6.2 Limitations
- Linear model: Cannot capture non-linear cytokine interactions
- Transcriptomics-only: Post-translational regulation invisible
- Signature matrix bias: Underrepresented cell types have weaker signatures
- Validation metric: Expression-activity correlation underestimates true accuracy (signatures acting through downstream effectors are not captured)
6.3 Future Directions
- scGPT cohort integration (~35M cells)
- cellxgene Census integration
- Classification of cytokine blocking therapy
7. Appendix: Technical Specifications
A. Computational Infrastructure
- GPU: NVIDIA A100 80GB (SLURM gpu partition)
- Memory: 256–512GB host RAM per node
- Pipeline: 24 Python scripts, 18 pipeline subpackages (~18.7K lines)
- API: 262 REST endpoints across 17 routers
- Frontend: 12 pages, 122 source files, 11.4K LOC
B. Statistical Methods
- Activity inference: Ridge regression (λ=5×105, z-score normalization, permutation-based significance)
- Correlation: Spearman rank correlation
- Multiple testing: Benjamini-Hochberg FDR (q < 0.05)
- Bootstrap: 100–1000 resampling iterations
- Differential: Wilcoxon rank-sum test with effect size